Background

This module extends code contained in Coronavirus_Statistics_v003.Rmd to include 2020-2021 data. This file includes the latest code for analyzing all-cause death data from CDC Weekly Deaths by Jurisdiction. CDC maintains data on deaths by week, age cohort, and state in the US. Downloaded data are unique by state, epidemiological week, year, age, and type (actual vs. predicted/projected).

These data are known to have a lag between death and reporting, and the CDC back-correct to report deaths at the time the death occurred even if the death is reported in following weeks. This means totals for recent weeks tend to run low (lag), and the CDC run a projection of the expected total number of deaths given the historical lag times. Per other analysts on the internet, there is currently significant supra-lag, with lag times much longer than historical averages causing CDC projected deaths for recent weeks to be low.

Functions for the previous module are available in Coronavirus_Statistics_Functions_CDC_v003.R and Coronavirus_Statistics_Functions_Shared_v003.R. The appropriate subset of these functions are copied and edited for use in this module. The code leverages tidyverse and a variable mapping file throughout:

# All functions assume that tidyverse and its components are loaded and available
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.1.1     v dplyr   1.0.6
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# If the same function is in both files, use the version from the more specific source
source("./Generic_Added_Utility_Functions_202105_v001.R")
# source("./Coronavirus_Statistics_Functions_USAF_v003.R")

Running Code

The main function is readRunCDCAllCause(), which performs multiple tasks:

STEP 0: Optionally, downloads the latest data file from CDC STEP 1: Reads and processes a data file has been downloaded from CDC to local
STEP 2: Extract relevant data from a processed state-level COVID Tracking Project list
STEP 3: Basic plots of the CDC data
STEP 4: Basic excess-deaths analysis
STEP 5: Create cluster-level aggregate plots
STEP 6: Create state-level aggregate plots
STEP 7: Create age-cohort aggregate plots
STEP 8: Returns a list of key data frames, modeling objects, named cluster vectors, etc.

Modifications: Main function

The main function to be edited is readRunCDCAllCause():

# Function to read and run the CDC all-cause deaths analysis
readRunCDCAllCause <- function(loc, 
                               startYear, 
                               curYear,
                               weekThru, 
                               startWeek, 
                               lst, 
                               cvDeathThru,
                               epiMap=readFromRDS("epiMonth"),
                               agePopData=readFromRDS("usPopBucket2020"),
                               cdcPlotStartWeek=startWeek,
                               periodKeep=paste0(startYear, "-", curYear-1), 
                               url="https://data.cdc.gov/api/views/y5bj-9g5w/rows.csv?accessType=DOWNLOAD",
                               dlData=isFALSE(file.exists(paste0(dir, loc))), 
                               dir="./RInputFiles/Coronavirus/", 
                               stateNoCheck=c()
                               ) {
    
    # FUNCTION ARGUMENTS:
    # loc: the CDC .csv file name (without path)
    # startYear: the starting year in the CDC data
    # curYear: the current analyis year in the CDC data
    # weekThru: how many weeks of the current year are the data valid thru?
    # startWeek: the starting week to use for cumulative sum of difference in expected all-cause deaths
    # lst: a state clustering process output list
    # epiMap: a mapping file of ew-month-quarter that mas epiweek (ew) to an appropriate month and quarter
    # agePopData: data containing US population as age (fct) - pop (int)
    # cvDeathThru: the date to use for pulling the CV death data
    # cdcPlotStartWeek: start week for CDC plots (10 is March which avoids a 1-week February outlier)
    # periodKeep: the period of previous data in the CDC all-cause deaths file (should be kept regardless)
    # url: the url for downloading CDC data
    # dlData: should CDC data be downloaded?  default is to not overwrite an existing file
    # dir: the CDC .csv directory (will use paste0(dir, loc) as the file location)
    # stateNoCheck: vector of states that should not error out for failing suppression checks
    
    # STEP 0: Download CDC data if requested
    if (dlData) fileDownload(fileName=paste0(dir, loc), url=url)
    
    return()
    
    # STEP 1: Read and process the CDC data
    cdc <- readProcessCDC(loc, weekThru=weekThru, periodKeep=periodKeep, fDir=dir, stateNoCheck=stateNoCheck)
    
    # STEP 2: Create the key data required for using state-level clusters
    clusterList <- helperKeyStateClusterMetrics(lst)
    
    # STEP 3: Generate plots of the processed CDC data
    cdcBasicPlots(cdc, clustVec=clusterList$clData, stateExclude=stateNoCheck)

    # Exclude stateNoCheck states
    cat("\nPlots will be run after excluding stateNoCheck states\n")
    cdcUse <- cdc %>% filter(!(state %in% stateNoCheck))
    
    # STEP 4: Full US excess deaths
    list_allUS <- cdcCohortAnalysis(cohortName="all US", 
                                    df=cdcUse,
                                    curYear=curYear, 
                                    startYear=startYear,
                                    startWeek=startWeek,
                                    plotTitle="All-cause US total deaths",
                                    predActualPlotsOnePage=TRUE
    )
    # cat("\n\n\n*** THROUGH STEP 4 ***\n\n\n")
    
    # STEP 5: Generate cluster-level aggregate plots
    clusterAgg <- cdcAggregateSummary(df=cdcUse, 
                                      critVar="state", 
                                      critSubsets=clusterList$stateCluster,
                                      startWeek=startWeek, 
                                      critListNames=paste0("cluster ", 1:length(clusterList$stateCluster)),
                                      factorCritList=FALSE,
                                      popData=clusterList$pop,
                                      cvDeathData=clusterList$deaths,
                                      idVarName="cluster"
    )
    # cat("\n\n\n*** THROUGH STEP 5 ***\n\n\n")
    
    # STEP 6: Generate state-level aggregate data, then plot
    stateAgg <- cdcAggregateSummary(df=cdcUse, 
                                    critVar="state", 
                                    critSubsets=setdiff(names(clusterList$clData), stateNoCheck),
                                    startWeek=startWeek, 
                                    idVarName="state", 
                                    subListNames=setdiff(names(clusterList$clData), stateNoCheck),
                                    showAllPlots=FALSE
    )
    # cat("\n\n\n*** THROUGH STEP 6a ***\n\n\n")
    
    helperKeyStateExcessPlots(df=stateAgg, 
                              epiMonth=epiMap,
                              cvDeaths=lst$consolidatedPlotData,
                              startWeek=cdcPlotStartWeek,
                              cvDeathDate=as.Date(cvDeathThru),
                              subT=paste0("CDC data through week ", weekThru, " of ", curYear)
    )
    # cat("\n\n\n*** THROUGH STEP 6b ***\n\n\n")
    
    # STEP 7: Generate age-level aggregate data, then plot
    ageAgg <- cdcAggregateSummary(df=cdcUse, 
                                  critVar="age", 
                                  critSubsets=levels(cdc$age),
                                  startWeek=startWeek, 
                                  idVarName="age", 
                                  subListNames=levels(cdc$age),
                                  showAllPlots=TRUE
    )
    helperKeyAgeExcessPlots(df=ageAgg, 
                            epiMonth=epiMap,
                            cvDeaths=lst$consolidatedPlotData,
                            popData=agePopData,
                            startWeek=cdcPlotStartWeek,
                            cvDeathDate=as.Date(cvDeathThru),
                            subT=paste0("CDC data through week ", weekThru, " of ", curYear)
    )
    
    # STEP 8: Return a list of key files
    list(cdc=cdc, 
         clusterList=clusterList, 
         allUSAgg=list_allUS$preds,
         clusterAgg=clusterAgg, 
         stateAgg=stateAgg, 
         ageAgg=ageAgg
    )
    
}

Modifications: Additional functions

The data are initially run to download the latest CDC data:

cdcLoc <- "Weekly_counts_of_deaths_by_jurisdiction_and_age_group_downloaded_20210623.csv"
readRunCDCAllCause(loc=cdcLoc)
## NULL

The data are inspected to check that they are sufficient for the next steps:

remapVars=c('Jurisdiction'='fullState', 
            'Week Ending Date'='weekEnding', 
            'State Abbreviation'='state', 
            'Age Group'='age', 
            'Number of Deaths'='deaths', 
            'Time Period'='period', 
            'Year'='year', 
            'Week'='week'
            )

cdcRaw <- fileRead(paste0("./RInputFiles/Coronavirus/", cdcLoc), col_types="cccddcdcccc") %>%
    colRenamer(vecRename=remapVars)

cdcRaw
## # A tibble: 197,545 x 11
##    fullState weekEnding state  year  week age    deaths period  Type    Suppress
##    <chr>     <chr>      <chr> <dbl> <dbl> <chr>   <dbl> <chr>   <chr>   <chr>   
##  1 Alabama   01/10/2015 AL     2015     1 25-44~     67 2015-2~ Predic~ <NA>    
##  2 Alabama   01/17/2015 AL     2015     2 25-44~     49 2015-2~ Predic~ <NA>    
##  3 Alabama   01/24/2015 AL     2015     3 25-44~     55 2015-2~ Predic~ <NA>    
##  4 Alabama   01/31/2015 AL     2015     4 25-44~     59 2015-2~ Predic~ <NA>    
##  5 Alabama   02/07/2015 AL     2015     5 25-44~     47 2015-2~ Predic~ <NA>    
##  6 Alabama   02/14/2015 AL     2015     6 25-44~     59 2015-2~ Predic~ <NA>    
##  7 Alabama   02/21/2015 AL     2015     7 25-44~     41 2015-2~ Predic~ <NA>    
##  8 Alabama   02/28/2015 AL     2015     8 25-44~     47 2015-2~ Predic~ <NA>    
##  9 Alabama   03/07/2015 AL     2015     9 25-44~     59 2015-2~ Predic~ <NA>    
## 10 Alabama   03/14/2015 AL     2015    10 25-44~     57 2015-2~ Predic~ <NA>    
## # ... with 197,535 more rows, and 1 more variable: Note <chr>
cdcRaw %>% 
    group_by(state, Type) %>% 
    summarize(deaths=sum(deaths, na.rm=TRUE)) %>%
    arrange(-deaths)
## `summarise()` has grouped output by 'state'. You can override using the `.groups` argument.
## # A tibble: 108 x 3
## # Groups:   state [54]
##    state Type                   deaths
##    <chr> <chr>                   <dbl>
##  1 US    Predicted (weighted) 18787183
##  2 US    Unweighted           18736023
##  3 CA    Predicted (weighted)  1804503
##  4 CA    Unweighted            1802307
##  5 TX    Predicted (weighted)  1367691
##  6 FL    Predicted (weighted)  1367661
##  7 FL    Unweighted            1366098
##  8 TX    Unweighted            1363538
##  9 PA    Predicted (weighted)   898425
## 10 PA    Unweighted             897361
## # ... with 98 more rows
cdcRaw %>%
    mutate(subUnit=ifelse(state=="US", "whole", "components")) %>%
    group_by(subUnit, Type, year) %>%
    summarize(deaths=sum(deaths, na.rm=TRUE), .groups="drop") %>%
    ggplot(aes(x=year)) + 
    geom_col(aes(y=deaths), fill="lightblue") + 
    geom_text(aes(y=deaths/2, label=paste0(round(deaths/1000000, 2), " MM")), size=3) + 
    facet_grid(Type~subUnit) + 
    labs(x=NULL, y="Deaths", title="Raw CDC Summed Deaths")

Back of the envelope, the data appear reasonable for use in refining code.

The readrProcessCDC() function is updated:

ageLevels <- c("Under 25 years", "25-44 years", "45-64 years", 
               "65-74 years", "75-84 years", "85 years and older"
               )
periodLevels <- c("2015-2019", "2020", "2021")
yearLevels <- 2015:2021

# Function to read and process raw CDC all-cause deaths data
readProcessCDC <- function(fName, 
                           weekThru,
                           periodKeep=c("2015-2019", "2020"),
                           fDir="./RInputFiles/Coronavirus/",
                           col_types="ccciicdcccc", 
                           renameVars=remapVars,  # Update this approach 
                           maxSuppressAllowed=10, 
                           stateNoCheck=c()
                           ) {
    
    # FUNCTION ARGUMENTS:
    # fName: name of the downloaded CDC data file
    # weekThru: any record where week is less than or equal to weekThru will be kept
    # periodKeep: any record where period is in periodKeep will be kept
    # fDir: directory name for the downloaded CDC data file
    # col_types: variable type by column in the CDC data (passed to readr::read_csv())
    # renameVars: named vector for variable renaming of type c("Existing Name"="New Name")
    # maxSuppressAllowed: maximum number of data suppressions (must be in current week/year) to avoid error
    # stateNoCheck: vector of states that do NOT have suppression errors thrown
    
    # STEP 1: Read the CSV data
    cdcRaw <- fileRead(paste0(fDir, fName), col_types=col_types)
    # glimpse(cdcRaw)
    
    # STEP 2: Rename the variables for easier interpretation
    cdcRenamed <- cdcRaw %>%
        colRenamer(vecRename=renameVars) %>%
        colMutater(selfList=list("weekEnding"=lubridate::mdy))
    # glimpse(cdcRenamed)
    
    # STEP 3: Convert to factored data
    cdcFactored <- cdcRenamed %>%
        colMutater(selfList=list("age"=factor), levels=ageLevels) %>%
        colMutater(selfList=list("period"=factor), levels=periodLevels) %>%
        colMutater(selfList=list("year"=factor), levels=yearLevels)
    # glimpse(cdcFactored)
    
    # STEP 4: Filter the data to include only weighted deaths and only through the desired time period
    cdcFiltered <- cdcFactored %>%
        rowFilter(lstFilter=list("Type"="Predicted (weighted)")) %>%
        filter(period %in% all_of(periodKeep) | week <= weekThru)
    # glimpse(cdcFiltered)
    
    # STEP 4a: Check that all suppressed data and NA deaths have been eliminated
    cat("\n\n *** Data suppression checks *** \n")
    checkProblems <- cdcFiltered %>% 
        mutate(problem=(!is.na(Suppress) | is.na(deaths)), 
               curWeek=(!(year %in% all_of(periodKeep)) & week==weekThru), 
               noCheck=state %in% all_of(stateNoCheck)
               ) %>%
        group_by(noCheck, state, problem, curWeek) %>%
        summarize(n=n(), deaths=specNA(sum)(deaths), .groups="drop")
    errorProblems <- checkProblems %>%
        filter(problem)
    print(errorProblems)
    errorProblems <- errorProblems %>%
        group_by(noCheck, curWeek) %>%
        summarize(n=sum(n), .groups="drop")
    print(errorProblems)
    nPrev <- errorProblems %>% filter(!noCheck, !curWeek) %>% pull(n) %>% sum()
    nCur <- errorProblems %>% filter(!noCheck, curWeek) %>% pull(n) %>% sum()
    if (nPrev > 0) stop(paste0("\n\n", nPrev, " records not from current week/year with problems.  Fix and retry\n"))
    if (nCur > maxSuppressAllowed) {
        cat("\n\n", nCur, " records from current week/year with problems, exceeds tolerance of ", maxSuppressAllowed)
        stop("\nFix and retry\n")
    }

    cat("\n\nData suppression checks passed\n\n")
    
    # STEP 5: Remove any NA death fields, delete the US record, convert YC to be part of NY
    cdcProcessed <- cdcFiltered %>%
        rowFilter(lstExclude=list("state"=c("US", "PR"), "deaths"=c(NA))) %>%
        mutate(state=ifelse(state=="YC", "NY", state), 
               fullState=ifelse(state %in% c("NY", "YC"), "New York State (NY plus YC)", fullState)
        ) %>%
        group_by(fullState, weekEnding, state, year, week, age, period, Type, Suppress) %>%
        arrange(!is.na(Note)) %>%
        summarize(n=n(), deaths=sum(deaths), Note=first(Note), .groups="drop") %>%
        ungroup() %>%
        checkUniqueRows(uniqueBy=c("state", "year", "week", "age"))
    glimpse(cdcProcessed)
    
    # STEP 5a: Check control levels for key variables in processed file
    cat("\nCheck Control Levels and Record Counts for Processed Data:\n")
    cdcCheck <- cdcProcessed %>% mutate(n=1, n_deaths_na=ifelse(is.na(deaths), 1, 0))
    purrr::walk(list(c("age"), c("period", "year", "Type"), c("period", "Suppress"), c("period", "Note")), 
                .f=function(x) {
                    cat("\n\nChecking variable combination:", x, "\n")
                    checkControl(cdcCheck, groupBy=x, useVars=c("n", "n_deaths_na", "deaths"), fn=specNA(sum))
                    }
                )
    p1 <- checkControl(cdcCheck, 
                       groupBy=c("state"), 
                       useVars=c("deaths"), 
                       fn=specNA(sum), 
                       printControls=FALSE, 
                       pivotData=FALSE
                       ) %>%
        ggplot(aes(x=fct_reorder(state, deaths), y=deaths)) + 
        geom_col(fill="lightblue") + 
        geom_text(aes(y=deaths, label=paste0(round(deaths/1000), "k")), hjust=0, size=3) + 
        coord_flip() +
        labs(y="Total deaths", x=NULL, title="Total deaths by state in all years in processed file")
    print(p1)
    p2 <- checkControl(cdcCheck, 
                       groupBy=c("year", "week"), 
                       useVars=c("deaths"), 
                       fn=specNA(sum), 
                       printControls=FALSE, 
                       pivotData=FALSE
                       ) %>%
        ggplot(aes(x=week, y=deaths)) + 
        geom_line(aes(group=year, color=year)) + 
        labs(title="Deaths by year and epidemiological week", x="Epi week", y="US deaths") + 
        scale_color_discrete("Year") + 
        lims(y=c(0, NA))
    print(p2)
    
    # STEP 6: Return the processed data file
    cdcProcessed
    
}

cdc_temp <- readProcessCDC(cdcLoc, weekThru=17, stateNoCheck=c("NC"))
## 
## 
##  *** Data suppression checks *** 
## # A tibble: 2 x 6
##   noCheck state problem curWeek     n deaths
##   <lgl>   <chr> <lgl>   <lgl>   <int>  <dbl>
## 1 TRUE    NC    TRUE    FALSE      72     NA
## 2 TRUE    NC    TRUE    TRUE        6     NA
## # A tibble: 2 x 3
##   noCheck curWeek     n
##   <lgl>   <lgl>   <int>
## 1 TRUE    FALSE      72
## 2 TRUE    TRUE        6
## 
## 
## Data suppression checks passed
## 
## 
## *** File has been checked for uniqueness by: state year week age 
## 
## Rows: 91,537
## Columns: 12
## $ fullState  <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Ala~
## $ weekEnding <date> 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10, 2015-01-10~
## $ state      <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",~
## $ year       <fct> 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015, 2015,~
## $ week       <int> 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4,~
## $ age        <fct> Under 25 years, 25-44 years, 45-64 years, 65-74 years, 75-8~
## $ period     <fct> 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015-2019, 2015~
## $ Type       <chr> "Predicted (weighted)", "Predicted (weighted)", "Predicted ~
## $ Suppress   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## $ n          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,~
## $ deaths     <dbl> 25, 67, 253, 202, 272, 320, 28, 49, 256, 222, 253, 332, 26,~
## $ Note       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,~
## 
## Check Control Levels and Record Counts for Processed Data:
## 
## 
## Checking variable combination: age 
## # A tibble: 6 x 4
##   age                    n n_deaths_na  deaths
##   <fct>              <dbl>       <dbl>   <dbl>
## 1 Under 25 years     10735           0  369164
## 2 25-44 years        13656           0  902390
## 3 45-64 years        16793           0 3549786
## 4 65-74 years        16783           0 3558139
## 5 75-84 years        16790           0 4401133
## 6 85 years and older 16780           0 5681860
## 
## 
## Checking variable combination: period year Type 
## # A tibble: 7 x 6
##   period    year  Type                     n n_deaths_na  deaths
##   <fct>     <fct> <chr>                <dbl>       <dbl>   <dbl>
## 1 2015-2019 2015  Predicted (weighted) 14364           0 2691180
## 2 2015-2019 2016  Predicted (weighted) 14445           0 2723236
## 3 2015-2019 2017  Predicted (weighted) 14404           0 2801986
## 4 2015-2019 2018  Predicted (weighted) 14400           0 2830372
## 5 2015-2019 2019  Predicted (weighted) 14415           0 2844025
## 6 2020      2020  Predicted (weighted) 14837           0 3433405
## 7 2021      2021  Predicted (weighted)  4672           0 1138268
## 
## 
## Checking variable combination: period Suppress 
## # A tibble: 3 x 5
##   period    Suppress     n n_deaths_na   deaths
##   <fct>     <chr>    <dbl>       <dbl>    <dbl>
## 1 2015-2019 <NA>     72028           0 13890799
## 2 2020      <NA>     14837           0  3433405
## 3 2021      <NA>      4672           0  1138268
## 
## 
## Checking variable combination: period Note 
## # A tibble: 9 x 5
##   period   Note                                            n n_deaths_na  deaths
##   <fct>    <chr>                                       <dbl>       <dbl>   <dbl>
## 1 2015-20~ <NA>                                        72028           0  1.39e7
## 2 2020     Data in recent weeks are incomplete. Only ~ 13194           0  2.96e6
## 3 2020     Data in recent weeks are incomplete. Only ~   531           0  2.31e5
## 4 2020     Weighted numbers of deaths are 20% or more~   280           0  6.00e4
## 5 2020     Weights may be too low to account for unde~    18           0  9.85e3
## 6 2020     <NA>                                          814           0  1.69e5
## 7 2021     Data in recent weeks are incomplete. Only ~  4469           0  1.10e6
## 8 2021     Data in recent weeks are incomplete. Only ~    14           0  9.65e2
## 9 2021     Data in recent weeks are incomplete. Only ~   189           0  3.58e4

cdc_temp
## # A tibble: 91,537 x 12
##    fullState weekEnding state year   week age      period Type    Suppress     n
##    <chr>     <date>     <chr> <fct> <int> <fct>    <fct>  <chr>   <chr>    <int>
##  1 Alabama   2015-01-10 AL    2015      1 Under 2~ 2015-~ Predic~ <NA>         1
##  2 Alabama   2015-01-10 AL    2015      1 25-44 y~ 2015-~ Predic~ <NA>         1
##  3 Alabama   2015-01-10 AL    2015      1 45-64 y~ 2015-~ Predic~ <NA>         1
##  4 Alabama   2015-01-10 AL    2015      1 65-74 y~ 2015-~ Predic~ <NA>         1
##  5 Alabama   2015-01-10 AL    2015      1 75-84 y~ 2015-~ Predic~ <NA>         1
##  6 Alabama   2015-01-10 AL    2015      1 85 year~ 2015-~ Predic~ <NA>         1
##  7 Alabama   2015-01-17 AL    2015      2 Under 2~ 2015-~ Predic~ <NA>         1
##  8 Alabama   2015-01-17 AL    2015      2 25-44 y~ 2015-~ Predic~ <NA>         1
##  9 Alabama   2015-01-17 AL    2015      2 45-64 y~ 2015-~ Predic~ <NA>         1
## 10 Alabama   2015-01-17 AL    2015      2 65-74 y~ 2015-~ Predic~ <NA>         1
## # ... with 91,527 more rows, and 2 more variables: deaths <dbl>, Note <chr>

The process now reads and processes the file and runs diagnostics for sanity checks.

The helperKeyStateClusterMetrics() function is also updated:

# Convert the output of a state-level clustering list to population, membership, and deaths
helperKeyStateClusterMetrics <- function(lst) {
    
    # FUNCTION ARGUMENTS:
    # lst: the list containing the outputs of the state clustering routing
    
    # Extract the relevant elements from lst
    plotClusters <- lst[["plotDataList"]][["plotClusters"]]
    dfFull <- lst[["plotDataList"]][["dfFull"]] %>% select(state, date, pop, tot_deaths, new_deaths)
    
    # Create the aggregated data
    df <- joinFrames(plotClusters, dfFull, keyJoin="state")
    
    # Create reported cvDeaths by cluster
    dfWeekly <- df %>%
        mutate(year=lubridate::epiyear(date), week=lubridate::epiweek(date), cluster=as.character(cluster)) %>%
        group_by(state, cluster, year, week) %>%
        summarize(pop=max(pop), 
                  tot_deaths=specNA(max)(tot_deaths), 
                  new_deaths=specNA(sum)(new_deaths), 
                  .groups="drop"
                  ) %>%
        group_by(cluster, year, week) %>%
        summarize(pop=sum(pop), 
                  tot_deaths=specNA(sum)(tot_deaths), 
                  new_deaths=specNA(sum)(new_deaths), 
                  .groups="drop"
                  ) %>%
        checkUniqueRows(uniqueBy=c("cluster", "year", "week"))
    
    list(deaths=dfWeekly, useClusters=lst[["useClusters"]])
    
}

# Get a state-level clustering file
cdc_daily_210528 <- readFromRDS("cdc_daily_210528")
clusterList_temp <- helperKeyStateClusterMetrics(cdc_daily_210528)
## 
## *** File has been checked for uniqueness by: cluster year week
clusterList_temp
## $deaths
## # A tibble: 512 x 6
##    cluster  year  week      pop tot_deaths new_deaths
##    <chr>   <dbl> <dbl>    <dbl>      <dbl>      <dbl>
##  1 3        2020     1  1415872         NA         NA
##  2 3        2020     2  1415872         NA         NA
##  3 3        2020     3  1415872         NA         NA
##  4 3        2020     4 16576414          0          0
##  5 3        2020     5 16576414          0          0
##  6 3        2020     6 16576414          0          0
##  7 3        2020     7 16576414          0          0
##  8 3        2020     8 16576414          0          0
##  9 3        2020     9 16576414          1          1
## 10 3        2020    10 16576414         16         15
## # ... with 502 more rows
## 
## $useClusters
## AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA MD ME MI MN MO MS 
##  4  9  9  9  6  4  8  5  5  6  9  3  7  4  7  7  7  6  9  8  5  3  5  4  4  9 
## MT NC ND NE NH NJ NM NV NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV WY 
##  4  6  7  4  3  8  7  9  8  6  7  3  9  8  9  7  9  6  4  6  3  3  4  6  4
clusterList_temp$deaths %>%
    pivot_longer(-c("cluster", "year", "week")) %>%
    filter(name %in% c("tot_deaths", "new_deaths"), !is.na(value)) %>%
    ggplot(aes(x=week, y=value)) + 
    geom_line(aes(group=cluster, color=cluster)) + 
    facet_grid(name~year, scales="free_y") + 
    labs(x="Epi Week", y="Deaths (total or incremental)", title="Evolution of deaths by cluster")

Data appear to be usable for the next steps in CDC processing. The function cdcBasicPlots() is updated:

# Basic plots of the CDC data, including by a state-level cluster (passed as argument)
cdcBasicPlots <- function(df, 
                          weekThru=NULL, 
                          curYear=NULL, 
                          p5Years=curYear,
                          clustVec=NULL, 
                          stateExclude=c()
                          ) {
    
    # FUNCTION ARGUMENTS:
    # df: a processed CDC data file
    # weekThru: week of the current year that data are thru (NULL means infer from data)
    # curYear: current year (NULL means infer from data)
    # p5Years: years for plotting deaths by week/cohort (defaults to only current year)
    # clustVec: clustering vector with names as state abbreviations (NULL means no plots by cluster)
    # stateExclude: vector of states to exclude from the analysis
    
    # Get the week and year if passed as NULL
    if (is.null(curYear)) { curYear <- df %>% pull(year) %>% as.character() %>% as.integer() %>% max() }
    if (is.null(weekThru)) { weekThru <- df %>% filter(year==curYear) %>% pull(week) %>% max() }
    
    # Filter to exclude stateExclude
    df <- df %>%
        filter(!(state %in% stateExclude))
    
    # Update subtitle appropriately
    subT <- NULL
    if (length(stateExclude) > 0) subT <- paste0("Excludes ", paste0(stateExclude, collapse=", "))
    
    # Plot of total deaths by year (50 states plus DC)
    p1 <- df %>%
        group_by(year) %>%
        summarize(deaths=sum(deaths), .groups="drop") %>%
        ggplot(aes(x=year, y=deaths/1000)) + 
        geom_col(fill="lightblue") + 
        geom_text(aes(y=deaths/2000, label=round(deaths/1000))) + 
        labs(title=paste0("CDC Deaths for 50 states plus DC through week ", weekThru, " of ", curYear), 
             x="", 
             y="Deaths (000s)"
        )
    if (!is.null(subT)) p1 <- p1 + labs(subtitle=subT)
    print(p1)
    
    # Plot of total deaths by week by year
    p2 <- df %>%
        group_by(year, week) %>%
        summarize(deaths=sum(deaths), .groups="drop") %>%
        ggplot(aes(x=week, y=deaths, group=year, color=year)) + 
        geom_line() +
        ylim(c(0, NA)) + 
        labs(x="Week of Year", y="Deaths", title="US deaths per week by year")
    if (!is.null(subT)) p2 <- p2 + labs(subtitle=subT)
    print(p2)
    
    # Plot of total deaths by year by age cohort
    p3 <- df %>%
        group_by(year, week, age) %>%
        summarize(deaths=sum(deaths), .groups="drop") %>%
        ggplot(aes(x=week, y=deaths, group=year, color=year)) + 
        geom_line() +
        ylim(c(0, NA)) + 
        facet_wrap(~age) +
        labs(x="Week of Year", y="Deaths", title="US deaths per week by year")
    if (!is.null(subT)) p3 <- p3 + labs(subtitle=subT)
    print(p3)
    
    # Plots of total deaths by week by year by cluster
    if (!is.null(clustVec)) {
        p4 <- df %>%
            mutate(cluster=clustVec[state]) %>%
            group_by(year, week, cluster) %>%
            summarize(deaths=sum(deaths), .groups="drop") %>%
            ggplot(aes(x=week, y=deaths, group=year, color=year)) + 
            geom_line() +
            ylim(c(0, NA)) + 
            facet_wrap(~cluster, scales="free_y") +
            labs(x="Week of Year", 
                 y="Deaths", 
                 title="US deaths per week by year"
            )
        if (is.null(subT)) p4 <- p4 + labs(subtitle="Facetted by state cluster")
        else p4 <- p4 + labs(subtitle=paste0("Facetted by state cluster\n", subT))
        print(p4)
        
        p5 <- df %>%
            mutate(cluster=clustVec[state]) %>%
            filter(year %in% p5Years) %>%
            group_by(year, age, week, cluster) %>%
            summarize(deaths=sum(deaths), .groups="drop") %>%
            mutate(weekUse=week+53*(as.integer(as.character(year))-curYear)) %>%
            ggplot(aes(x=weekUse, y=deaths, fill=age)) + 
            geom_col(position="stack") +
            ylim(c(0, NA)) + 
            facet_wrap(~cluster, scales="free_y") +
            labs(x="Week of Year", 
                 y="Deaths", 
                 title=paste0("US deaths per week (Week 1 is start of year ", curYear, ")")
            )
        if (is.null(subT)) p5 <- p5 + labs(subtitle="Facetted by state cluster")
        else p5 <- p5 + labs(subtitle=paste0("Facetted by state cluster\n", subT))
        print(p5)
    }
    
}


cdcBasicPlots(cdc_temp, clustVec=clusterList_temp$useClusters, stateExclude="NC", p5Years=2019:2021)

The function cdcCohortAnalysis() is updated:

# Function to subset CDC data
subsetCDC <- function(df, 
                      critFilter=vector("list", 0), 
                      showPlot=TRUE, 
                      plotTitle=NULL, 
                      curYear=2020:2021, 
                      prevYears=paste0("2015-", min(curYear)-1)
                      ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the processed CDC data
    # critFilter: a filtering list of the form variable=c(possibleValues)
    # showPlot: boolean, whether to show plots of deaths by week and year
    # plotTitle: the title to be used for the plot (NULL will use a default title)
    # curYear: the current year (will be excluded from the mean and standard deviation plot)
    
    # Filter such that only matches to critFilter are included
    for (xNum in seq_len(length(critFilter))) {
        df <- df %>%
            filter_at(vars(all_of(names(critFilter)[xNum])), ~. %in% critFilter[[xNum]])
    }
    
    # Create a default title if none provided
    if (is.null(plotTitle)) plotTitle <- "All-cause deaths for filtered cohort"
    
    # All deaths by week and year for the filtered cohort
    allDeath <- df %>%
        group_by(year, week) %>%
        summarize(deaths=sum(deaths), .groups="drop") %>%
        mutate(weekfct=factor(week), yearint=as.integer(as.character(year)))
    
    # Show plots if requested
    if (showPlot) {
        # Plot of all deaths for the filtered cohort by week and year
        p1 <- allDeath %>%
            ggplot(aes(x=week, y=deaths, color=year, group=year)) + 
            geom_line() + 
            ylim(c(0, NA)) + 
            labs(x="Week", y="Deaths", title=plotTitle)
        print(p1)
        
        # Plot of mean and standard deviation from previous years
        p2 <- allDeath %>%
            filter(!(year %in% curYear)) %>%
            group_by(week) %>%
            summarize(mean=mean(deaths), sd=sd(deaths), .groups="drop") %>%
            ggplot(aes(x=week)) + 
            geom_line(aes(y=mean), color="red", size=2) + 
            geom_ribbon(aes(ymin=mean-sd, ymax=mean+sd), fill="pink", alpha=0.5) +
            labs(x="Week", 
                 y="Deaths", 
                 title=paste0(prevYears, " ", plotTitle), 
                 subtitle=paste0("Line is ", 
                                 prevYears, 
                                 " annual mean deaths, ribbon is +/- 1 standard deviation"
                 )
            ) + 
            ylim(c(0, NA))
        print(p2)
        
    }
    
    # Return the processed data frame
    allDeath
    
}



# Function to run the linear model to predict CDC deaths by week and year
cdcRegression <- function(df, 
                          curYear=2020:2021, 
                          startYear=2015
                          ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the processed CDC data
    # curYear: the current year(s)
    # startYear: the year that is defined as the zero-point for the regression
    
    # Very basic linear model for deaths vs. week and year
    df %>%
        filter(!(year %in% curYear)) %>%
        lm(deaths ~ weekfct + I(yearint-startYear) + 0, data=.)
    
}



# Function to make predictions based on the regression
cdcPrediction <- function(df, 
                          lmReg
                          ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame or tibble containing the data
    # lmReg: the linear regression model
    
    # Predictions and deltas
    allPred <- df %>%
        mutate(oldweekfct=weekfct, weekfct=factor(ifelse(weekfct==53, 52, weekfct))) %>%
        mutate(pred=predict(lmReg, newdata=.), delta=(deaths-pred)) %>%
        group_by(year) %>%
        arrange(week) %>%
        mutate(cumDelta=cumsum(delta), cumPred=cumsum(pred), weekfct=oldweekfct) %>%
        select(-oldweekfct) %>%
        ungroup()
    
    # Return the predictions and deltas    
    allPred
    
}



# Function to make plots of the predictions
cdcPredictedvsActual <- function(df, 
                                 cohortName,
                                 curYear=2020:2021, 
                                 prevYears=paste0("2015-", min(curYear)-1), 
                                 startWeek=1, 
                                 showCurrentPredvsPrev=TRUE, 
                                 showActualsvsPred=TRUE,
                                 showActualsvsPredRatio=TRUE, 
                                 showExcessDeaths=TRUE, 
                                 predActualPlotsOnePage=FALSE
                                 ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing actual and predicted deaths
    # cohortName: cohort name to use for plot titles
    # curYear: the current year
    # prevYears: the previous years
    # startWeek: week to use for starting the counts of cumulative excess deaths
    # showCurrentPredvsPrev: boolean, whether to plot current year predictions vs. previous years
    # showActualsvsPred: boolean, whether to plot actuals vs. predictions by year
    # showActualsvsPredRatio: boolean, whether to plot actuals vs. predictions ratios by yrea
    # showExcessDeaths: boolean, whether to plot estimated excess deaths
    # predActualPlotsOnePage: boolean, whether to place the predicted vs. actual plots all on one page
    
    if (showCurrentPredvsPrev) {
        # Current year predictions vs. previous year actuals
        p1 <- df %>%
            ggplot(aes(x=week)) + 
            geom_line(data=~filter(., !(year %in% curYear)), aes(y=deaths, color=year, group=year)) + 
            geom_line(data=~filter(., year %in% curYear), aes(y=pred, group=year, color=year), lty=2, size=1) +
            ylim(c(0, NA)) + 
            labs(x="Week", 
                 y="Deaths", 
                 title=paste0("All-cause deaths for ", cohortName, " cohort"), 
                 subtitle=paste0("Solid lines are actual ", 
                                 prevYears, 
                                 ", dashed line is predicted ", 
                                 if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear
                 )
            )
        if (!predActualPlotsOnePage) print(p1)
    }
    
    if (showActualsvsPred) {
        # Plot of predictions and actuals facetted by year - deaths
        p2 <- df %>%
            select(year, week, deaths, pred) %>%
            pivot_longer(c(deaths, pred)) %>%
            mutate(name=ifelse(name=="deaths", "Actual", "Predicted")) %>%
            ggplot(aes(x=week)) + 
            geom_line(aes(y=value, color=name, group=name)) + 
            ylim(c(0, NA)) + 
            facet_wrap(~year) + 
            labs(x="Week", 
                 y="Deaths", 
                 title=paste0("All-cause deaths for ", cohortName, " cohort"), 
                 subtitle=paste0("Predictions based on simple linear model for ", prevYears, " data")
            ) + 
            scale_color_discrete("Deaths")
        if (!predActualPlotsOnePage) print(p2)
    }
    
    if (showActualsvsPredRatio) {
        # Plot of predictions and actuals facetted by year - ratios
        p3 <- df %>%
            mutate(rat=deaths/pred) %>%
            ggplot(aes(x=week)) + 
            geom_line(aes(y=rat)) + 
            facet_wrap(~year) + 
            geom_hline(yintercept=1, lty=2) +
            labs(x="Week", 
                 y="Ratio (Actual vs. Predicted Deaths)", 
                 title=paste0("Actual vs. predicted deaths for ", cohortName, " cohort"), 
                 subtitle=paste0("Predictions based on simple linear model for ", prevYears, " data")
            )
        if (!predActualPlotsOnePage) print(p3)
    }
    
    if (showExcessDeaths) {
        # Plot of excess deaths after a starting week
        p4 <- df %>%
            select(-cumDelta, -cumPred) %>%
            filter(week >= startWeek) %>%
            group_by(year) %>%
            arrange(week) %>%
            mutate(cumDelta=cumsum(delta)) %>%
            ggplot(aes(x=week)) + 
            geom_line(aes(y=cumDelta)) + 
            facet_wrap(~year) + 
            geom_hline(yintercept=0, lty=2) +
            labs(x="Week", 
                 y=paste0("Deaths vs. prediction (cumulative from week ", startWeek, ")"), 
                 title=paste0("Cumulative estimated excess deaths starting week ", 
                              startWeek, 
                              "\nfor ", 
                              cohortName, 
                              " cohort"
                 ), 
                 subtitle=paste0("Predictions based on simple linear model for ", prevYears, " data")
            )
        if (!predActualPlotsOnePage) print(p4)
    }
    
    # Create a one-page summary
    if (predActualPlotsOnePage) {
        if (!exists("p1", inherits=FALSE)) p1 <- ggplot()
        if (!exists("p2", inherits=FALSE)) p2 <- ggplot()
        if (!exists("p3", inherits=FALSE)) p3 <- ggplot()
        if (!exists("p4", inherits=FALSE)) p4 <- ggplot()
        gridExtra::grid.arrange(p1, p2, p3, p4, nrow=2)
    }
    
}



# Integrated function with all steps
cdcCohortAnalysis <- function(cohortName, 
                              df=cdcProcessed, 
                              critFilter=vector("list", 0), 
                              plotTitle=NULL, 
                              curYear=2020:2021, 
                              startYear=2015,
                              prevYears=paste0(startYear, "-", min(curYear)-1), 
                              startWeek=1, 
                              showSubsetPlots=TRUE, 
                              showPredActualPlots=TRUE, 
                              predActualPlotsOnePage=FALSE
                              ) {
    
    # FUNCTION ARGUMENTS:
    # cohortName: the name of the cohort
    # df: the processed CDC death data
    # critFilter: the filtering criteria in the form of a named list "variable"=c("possibleValues")
    # plotTitle: title to be used for plot in subsetCDC()
    # curYear: the current year(s) for the death data
    # startYear: the starting year for the death data
    # prevYears: the previous years for the death data (used for making predictions)
    # startWeek: the starting week for counting excess deaths
    # showSubsetPlots: boolean, whether to show the two plots in subsetCDC()
    # showPredActualPlots: boolean, whether to show the four plots in cdcPredictedvsActual()
    # predActualPlotsOnePage: boolean, whether to place the predicted vs. actual plots all on one page
    
    # Subset the data (will show both plots by default)
    allDeath_cohort <- subsetCDC(df=df, 
                                 critFilter=critFilter,
                                 showPlot=showSubsetPlots,
                                 plotTitle=plotTitle, 
                                 curYear=curYear,
                                 prevYears=prevYears
                                 )
    
    # Create the linear model for the subsetted data
    lm_cohort <- cdcRegression(df=allDeath_cohort, curYear=curYear, startYear=startYear)
    
    # Apply the predictions to the data
    allPred_cohort <- cdcPrediction(df=allDeath_cohort, lmReg=lm_cohort)
    
    # Plots for predicted vs. actual (will show all plots by default)
    cdcPredictedvsActual(df=allPred_cohort, 
                         cohortName=cohortName, 
                         curYear=curYear,
                         prevYears=prevYears,
                         startWeek=startWeek, 
                         showCurrentPredvsPrev=showPredActualPlots,
                         showActualsvsPred=showPredActualPlots,
                         showActualsvsPredRatio=showPredActualPlots,
                         showExcessDeaths=showPredActualPlots, 
                         predActualPlotsOnePage=predActualPlotsOnePage
    )
    
    # Return a list containing the lm model and the filtered data
    list(lmReg=lm_cohort, preds=allPred_cohort)
    
}



list_allUS_temp <- cdcCohortAnalysis(cohortName="all US", 
                                     df=cdc_temp %>% filter(!(state %in% c("NC"))),
                                     curYear=2020:2021, 
                                     startYear=2015,
                                     startWeek=1,
                                     plotTitle="All-cause US total deaths",
                                     predActualPlotsOnePage=TRUE
                                     )

list_allUS_temp
## $lmReg
## 
## Call:
## lm(formula = deaths ~ weekfct + I(yearint - startYear) + 0, data = .)
## 
## Coefficients:
##               weekfct1                weekfct2                weekfct3  
##                56785.0                 57133.8                 55731.8  
##               weekfct4                weekfct5                weekfct6  
##                54810.2                 54443.2                 54506.8  
##               weekfct7                weekfct8                weekfct9  
##                54138.2                 53497.6                 53296.2  
##              weekfct10               weekfct11               weekfct12  
##                53414.8                 52522.0                 52125.0  
##              weekfct13               weekfct14               weekfct15  
##                51404.8                 51357.6                 51011.0  
##              weekfct16               weekfct17               weekfct18  
##                49969.0                 49346.2                 49298.0  
##              weekfct19               weekfct20               weekfct21  
##                48537.8                 48023.8                 47864.8  
##              weekfct22               weekfct23               weekfct24  
##                47480.4                 47945.0                 47485.0  
##              weekfct25               weekfct26               weekfct27  
##                47381.6                 47239.2                 47446.8  
##              weekfct28               weekfct29               weekfct30  
##                47179.0                 46793.4                 46694.0  
##              weekfct31               weekfct32               weekfct33  
##                46772.4                 46904.0                 46694.6  
##              weekfct34               weekfct35               weekfct36  
##                46582.2                 46928.2                 47150.4  
##              weekfct37               weekfct38               weekfct39  
##                47418.6                 47437.4                 47544.6  
##              weekfct40               weekfct41               weekfct42  
##                48562.0                 48367.6                 49012.8  
##              weekfct43               weekfct44               weekfct45  
##                49172.0                 49497.6                 49792.8  
##              weekfct46               weekfct47               weekfct48  
##                50453.4                 50525.8                 50980.0  
##              weekfct49               weekfct50               weekfct51  
##                52046.6                 52429.8                 53166.0  
##              weekfct52  I(yearint - startYear)  
##                53725.2                   760.6  
## 
## 
## $preds
## # A tibble: 330 x 9
##    year   week deaths weekfct yearint   pred  delta cumDelta cumPred
##    <fct> <int>  <dbl> <fct>     <int>  <dbl>  <dbl>    <dbl>   <dbl>
##  1 2015      1  59578 1          2015 56785.  2793.    2793.  56785.
##  2 2016      1  53867 1          2016 57546. -3679.   -3679.  57546.
##  3 2017      1  57719 1          2017 58306.  -587.    -587.  58306.
##  4 2018      1  64008 1          2018 59067.  4941.    4941.  59067.
##  5 2019      1  56359 1          2019 59827. -3468.   -3468.  59827.
##  6 2020      1  58025 1          2020 60588. -2563.   -2563.  60588.
##  7 2021      1  84172 1          2021 61349. 22823.   22823.  61349.
##  8 2015      2  58974 2          2015 57134.  1840.    4633. 113919.
##  9 2016      2  53615 2          2016 57894. -4279.   -7958. 115440.
## 10 2017      2  58982 2          2017 58655    327.    -260. 116961.
## # ... with 320 more rows

The function cdcAggregateSummary() is updated:

# Function to create data and plots for an aggregate (cluster in this case)
cdcAggregateSummary <- function(df, 
                                critVar, 
                                critSubsets, 
                                startWeek,
                                idVarName=critVar,
                                curYear=2020:2021, 
                                startYear=2015,
                                subListNames=NULL,
                                critListNames=subListNames,
                                factorCritList=!is.null(critListNames),
                                popData=NULL,
                                cvDeathData=NULL,
                                showAllPlots=TRUE, 
                                showStep1Plots=showAllPlots,
                                showStep3Plots=showAllPlots
                                ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the processed CDC all-cause deaths data
    # critVar: the main variable supplied to critFilter
    # critSubsets: the subsets used for critVar
    # startWeek: the starting week for analysis (will assume no excess deaths prior to startWeek or curYear)
    # idVarName: name to be used for the identifying variable when excess deaths data are combined
    # curYear: current year(s) of CDC deaths data
    # subListNames: names to be used in place of numbers for idVarName column of excessFull
    #               (NULL means use numbers)
    # critListNames: names to be used for describing the key cohorts (NULL means use numbers)
    # factorCritList: boolean, whether to factor the variable that results from critListNames
    # startYear: starting year of CDC deaths data
    # popData: population data file, unique and matching by idVarName (NULL means no population plots)
    # cvDeathData: coronavirus deaths data file, unique and matching by idVarName
    #              NULL means no plots of excess all-cause deaths vs reported coronavirus deaths
    # showAllPlots: boolean, whether to show all the plots
    # showStep1Plots: boolean, whether to show a 1-page plot summary for each element of critSubsets
    # showStep3Plots: boolean, whether to show a 1-page cross-element summary
    
    # STEP 0: Initialize list to store results for each combination of critSubsets    
    tmpList <- vector("list", length(critSubsets))
    
    # STEP 1: Run the process for each combination of critSubsets (produce plots if showStep1Plots is TRUE)
    for (ctr in seq_along(critSubsets)) {
        critUse <- list(critSubsets[[ctr]])
        names(critUse) <- critVar
        useName <- if(is.null(critListNames)) paste0("Iteration: ", ctr) else critListNames[ctr]
        tmpList[[ctr]] <- cdcCohortAnalysis(cohortName=useName, 
                                            df=df, 
                                            critFilter=critUse, 
                                            plotTitle="",
                                            curYear=curYear,
                                            startYear=startYear,
                                            startWeek=startWeek, 
                                            showSubsetPlots=FALSE, 
                                            showPredActualPlots=showStep1Plots,
                                            predActualPlotsOnePage=showStep1Plots
        )
    }
    
    # STEP 1a: Add names if requeated
    if (!is.null(subListNames)) names(tmpList) <- subListNames
    
    # STEP 2: Create the excess file
    excessFull <- map_dfr(.x=tmpList, .f=function(x) { x[["preds"]] }, .id=idVarName)
    
    # STEP 2a: Convert idVarName to factor if critListName is passed
    if (factorCritList) {
        excessFull <- excessFull %>%
            mutate(!!idVarName:=factor(get(idVarName), levels=critListNames))
    }
    
    # STEP 3: Produce a plot of excess deaths by week by idVarName
    if (showStep3Plots) {
        p1 <- excessFull %>%
            filter(year %in% curYear) %>%
            ggplot(aes(x=week, y=delta)) + 
            geom_line(aes_string(group=idVarName, color=idVarName)) + 
            labs(title=paste0("Actual minus predicted deaths by week (", 
                              if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear, 
                              ")"
                              ), 
                 x="Week", 
                 y="Actual minus predicted deaths"
            ) + 
            facet_wrap(~year) +
            scale_color_discrete(idVarName)
        print(p1)
    }
    
    # STEP 4: Augment the data to include population (if popData is not NULL) and plot "per million"
    if (!is.null(popData)) {
        # Add population data to excessFull (only variables should be idVarName and pop)
        excessFull <- excessFull %>%
            left_join(select_at(popData, vars(all_of(c(idVarName, "pop")))), by=idVarName)
        # Plot of excess deaths per million by week by cluster
        p2 <- excessFull %>%
            filter(year %in% curYear) %>%
            ggplot(aes(x=week, y=1000000*delta/pop)) + 
            geom_line(aes(group=cluster, color=cluster)) + 
            labs(title=paste0("Actual minus predicted deaths by week (", 
                              if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear, 
                              ")"
                              ), 
                 subtitle="Per million people, per week",
                 x="Week", 
                 y="Actual minus predicted deaths (per million people)"
            ) + 
            scale_color_discrete(idVarName) + 
            facet_wrap(~year) +
            geom_hline(aes(yintercept=0), lty=2)
        print(p2)
    }
    
    # STEP 5: Augment the data to show plot of excess all-cause deaths vs. reported coronavirus deaths
    if (!is.null(cvDeathData)) {
        # Add the reported coronavirus deaths total
        excessFull <- excessFull %>%
            left_join(select_at(cvDeathData, vars(all_of(c(idVarName, "year", "week", "cvDeaths")))), 
                      by=c(idVarName, "yearint"="year", "week")
            ) %>%
            mutate(cvDeaths=ifelse(is.na(cvDeaths) | !(year %in% curYear), 0, cvDeaths))
        # Plot for total excess deaths and reported coronavirus deaths by idVarName by week
        p3 <- excessFull %>%
            filter(year %in% curYear) %>%
            mutate(useWeek=week+53*(yearint-min(curYear))) %>%
            select_at(vars(all_of(c(idVarName, "year", "week", "useWeek", "cvDeaths", "delta")))) %>%
            pivot_longer(c(cvDeaths, delta), names_to="source", values_to="deaths") %>%
            ggplot(aes(x=useWeek, y=deaths)) + 
            geom_line(aes(group=source, 
                          color=c("cvDeaths"="Reported COVID", "delta"="Excess All-Cause")[source]
                          )
                      ) + 
            facet_wrap(~get(idVarName)) + 
            scale_color_discrete("Deaths") + 
            labs(x="Week", 
                 y="Deaths", 
                 title=paste0("Deaths by ", idVarName, " by week"), 
                 subtitle=paste0("Week 1 is the first epi week of ", min(curYear))
                 )
        print(p3)
    }
    
    # STEP n: Return the excess file
    excessFull
    
}


clNames <- clusterList_temp$useClusters %>% unique()
clValues <- lapply(clNames, FUN=function(x) names(clusterList_temp$useClusters)[clusterList_temp$useClusters==x])
clusterAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))), 
                                       critVar="state", 
                                       critSubsets=clValues,
                                       startWeek=1, 
                                       subListNames=clNames,
                                       critListNames=paste0("cluster ", clNames),
                                       factorCritList=FALSE,
                                       popData=clusterList_temp$deaths %>% 
                                           group_by(cluster) %>% 
                                           summarize(pop=max(pop), .groups="drop"),
                                       cvDeathData=clusterList_temp$deaths %>% 
                                           select(cluster, year, week, cvDeaths=new_deaths),
                                       idVarName="cluster"
                                       )

stateAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))), 
                                     critVar="state", 
                                     critSubsets=setdiff(names(clusterList_temp$useClusters), "NC"),
                                     startWeek=1, 
                                     idVarName="state", 
                                     subListNames=setdiff(names(clusterList_temp$useClusters), "NC"),
                                     showAllPlots=FALSE
                                     )
stateAgg_temp
## # A tibble: 16,500 x 10
##    state year   week deaths weekfct yearint  pred   delta cumDelta cumPred
##    <fct> <fct> <int>  <dbl> <fct>     <int> <dbl>   <dbl>    <dbl>   <dbl>
##  1 AK    2015      1     62 1          2015  69.7  -7.71    -7.71     69.7
##  2 AK    2016      1     60 1          2016  71.4 -11.4    -11.4      71.4
##  3 AK    2017      1     85 1          2017  73    12.0     12.0      73  
##  4 AK    2018      1     71 1          2018  74.6  -3.65    -3.65     74.6
##  5 AK    2019      1     87 1          2019  76.3  10.7     10.7      76.3
##  6 AK    2020      1     77 1          2020  77.9  -0.938   -0.938    77.9
##  7 AK    2021      1    103 1          2021  79.6  23.4     23.4      79.6
##  8 AK    2015      2     77 2          2015  73.1   3.89    -3.82    143. 
##  9 AK    2016      2     65 2          2016  74.8  -9.75   -21.1     146. 
## 10 AK    2017      2     77 2          2017  76.4   0.600   12.6     149. 
## # ... with 16,490 more rows
ageAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))), 
                                   critVar="age", 
                                   critSubsets=levels(cdc_temp$age),
                                   startWeek=1, 
                                   idVarName="age", 
                                   subListNames=levels(cdc_temp$age),
                                   showAllPlots=TRUE
                                   )

ageAgg_temp
## # A tibble: 1,980 x 10
##    age           year   week deaths weekfct yearint  pred delta cumDelta cumPred
##    <fct>         <fct> <int>  <dbl> <fct>     <int> <dbl> <dbl>    <dbl>   <dbl>
##  1 Under 25 yea~ 2015      1   1027 1          2015 1099. -72.4    -72.4   1099.
##  2 Under 25 yea~ 2016      1   1027 1          2016 1079. -51.9    -51.9   1079.
##  3 Under 25 yea~ 2017      1   1107 1          2017 1058.  48.6     48.6   1058.
##  4 Under 25 yea~ 2018      1   1142 1          2018 1038. 104.     104.    1038.
##  5 Under 25 yea~ 2019      1    989 1          2019 1017. -28.4    -28.4   1017.
##  6 Under 25 yea~ 2020      1   1067 1          2020  997.  70.1     70.1    997.
##  7 Under 25 yea~ 2021      1    994 1          2021  976.  17.6     17.6    976.
##  8 Under 25 yea~ 2015      2   1050 2          2015 1090. -39.6   -112.    2189.
##  9 Under 25 yea~ 2016      2   1036 2          2016 1069. -33.1    -85.0   2148.
## 10 Under 25 yea~ 2017      2   1125 2          2017 1049.  76.4    125.    2107 
## # ... with 1,970 more rows

The function helperKeyStateExcessPlots() is updated:

# Create plots for state-level excess deaths
helperKeyStateExcessPlots <- function(df, 
                                      epiMonth,
                                      cvDeaths,
                                      subT,
                                      startWeek,
                                      cvDeathDate,
                                      curYear=2020:2021, 
                                      popData=select(usmap::statepop, state=abbr, pop=pop_2015)
                                      ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the processed state-level excess deaths data
    # epiMonth: a mapping file with columns ew-month-quarter that maps an epiweek to a month/quarter
    # cvDeaths: a file containing cv deaths that includes state-date-name-vpm
    # subT: the subtitle to be used for the plots 9describe the vintage of the CDC data)
    # startWeek: the starting week for the analysis (data from min(curYear) is deleted if week < startWeek)
    # cvDeathDate: the date to use for puling deaths from the cvDeaths file
    # curYear: the current year(s)
    # popData: a file containing only state-pop, with population by state
    
    # STEP 0: Create a general plotting database with quarter and month
    plotData <- df %>%
        filter(year %in% curYear) %>%
        left_join(epiMonth, by=c("week"="ew")) %>%
        mutate(quarter=factor(paste0(year, "-", "Q", quarter)), 
               postStart=ifelse(week>=startWeek | yearint>min(curYear), 1, 0)
               ) %>%
        group_by(state, yearint, quarter, month, postStart) %>%
        summarize(excess=sum(delta), .groups="drop") %>%
        left_join(popData, by="state") %>%
        mutate(excesspm=excess*1000000/pop)
    
    # STEP 1: Plot of excess deaths by quarter
    p1 <- plotData %>%
        group_by(state, quarter) %>%
        summarize(excess=sum(excess), .groups="drop") %>%
        ggplot(aes(x=fct_reorder(state, excess, .fun=sum), y=excess/1000, fill=fct_rev(quarter))) + 
        geom_col(position="stack") + 
        coord_flip() + 
        labs(x="State", 
             y="Excess Deaths (000s)", 
             title=paste0("All-cause excess deaths in ", 
                          if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
                          ), 
             subtitle=subT
        ) +
        scale_fill_discrete("Quarter")
    print(p1)
    
    # STEP 2: Plot of excess deaths per million by state by quarter
    p2 <- plotData %>%
        group_by(state, quarter) %>%
        summarize(excesspm=sum(excesspm), .groups="drop") %>%
        ggplot(aes(x=fct_reorder(state, excesspm, .fun=sum), y=excesspm, fill=fct_rev(quarter))) + 
        geom_col(position="stack") + 
        coord_flip() + 
        labs(x="State", 
             y="Excess Deaths (per million people)", 
             title=paste0("All-cause excess deaths per million people in ", 
                          if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
                          ), 
             subtitle=subT
        ) +
        scale_fill_discrete("Quarter")
    print(p2)
    
    # STEP 3: Plot of excess deaths by epi-month
    pes_001 <- plotData %>%
        filter(postStart==1) %>%
        ggplot(aes(x=fct_reorder(state, excesspm, .fun=sum))) + 
        geom_col(aes(y=excesspm, fill=fct_rev(paste0(yearint, "-", match(month, month.abb) %>% zeroPad2()))), 
                 position="stack"
                 ) + 
        coord_flip() + 
        labs(x="State", 
             y="Excess Deaths (per million people)", 
             title=paste0("All-cause excess deaths per million people in ", 
                          if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
                          ), 
             subtitle=subT, 
             caption="Epi week converted to month based on most frequent month in epi week"
        ) +
        scale_fill_discrete("Epi Month")
    print(pes_001)
    
    # STEP 4: Layer on the reported coronavirus deaths data
    cumDeath <- cvDeaths %>% 
        filter(name=="deaths", state != "cluster") %>% 
        group_by(state) %>% 
        mutate(cumdpm=cumsum(vpm)) %>%
        filter(date %in% c(cvDeathDate, max(date)))
    pes_002 <- pes_001 + 
        geom_point(data=filter(cumDeath, date==cvDeathDate), aes(x=fct_reorder(state, cumdpm), y=cumdpm), size=2) + 
        labs(title=paste0("Deaths per million people using data through ", format(cvDeathDate, "%b %d, %Y")), 
             subtitle="Bars based on excess CDC all-cause deaths, points are reported coronavirus deaths", 
             y="Deaths (per million people)"
        )
    print(pes_002)
    
    # STEP 5: Show ratio of estimated all-cause deaths to reported coronavirus deaths
    p5 <- plotData %>%
        filter(postStart==1) %>%
        group_by(state) %>%
        summarize(excesspm=sum(excesspm)) %>%
        inner_join(select(filter(cumDeath, date==cvDeathDate), state, cumdpm)) %>%
        mutate(rat=excesspm/cumdpm) %>%
        ggplot(aes(x=fct_reorder(state, rat), y=rat)) + 
        geom_col(fill="lightblue") + 
        geom_text(aes(y=rat/2, label=paste0(round(100*rat), "%"))) + 
        coord_flip() + 
        labs(x="State", 
             y="Ratio of estimated all-cause excess deaths to reported coronavirus deaths", 
             title="Ratio of estimated all-cause excess deaths to reported coronavirus deaths", 
             subtitle=paste0("All data through ", format(cvDeathDate, "%b %d, %Y"))
        ) + 
        ylim(c(0, NA)) + 
        geom_hline(aes(yintercept=1), lty=2)
    print(p5)
    
    # STEP n: Return the plotting data
    plotData
    
}


weekThru <- 17
curYear <- 2020:2021
cvDeathThru <- "2021-05-26"
stateExcessPlot_temp <- helperKeyStateExcessPlots(df=stateAgg_temp, 
                                                  epiMonth=readFromRDS("epiMonth"),
                                                  cvDeaths=readFromRDS("cdc_daily_210528")$plotDataList$dfFull %>% 
                                                      select(state, date, vpm=dpm) %>% 
                                                      mutate(name="deaths") %>% 
                                                      filter(!is.na(vpm)),
                                                  startWeek=10,
                                                  cvDeathDate=as.Date(cvDeathThru),
                                                  subT=paste0("CDC data through week ", 
                                                              weekThru, 
                                                              " of ", 
                                                              max(curYear)
                                                              ), 
                                                  curYear=curYear, 
                                                  popData=readFromRDS("statePop2019") %>% 
                                                      select(state=stateAbb, pop=pop_2019)
                                                  )

## Joining, by = "state"

The function helperKeyAgeExcessPlots() is updated:

# Create plots for age-cohort excess deaths
helperKeyAgeExcessPlots <- function(df, 
                                    epiMonth,
                                    popData,
                                    subT,
                                    startWeek,
                                    curYear=2020:2021
                                    ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the processed state-level excess deaths data
    # epiMonth: a mapping file with columns ew-month-quarter that maps an epiweek to a month/quarter
    # popData: a file containing only age-pop, with population by age cohort
    # subT: the subtitle to be used for the plots 9describe the vintage of the CDC data)
    # startWeek: the starting week for the analysis (only data in year==curYear with week>=startWeek used)
    # curYear: the current year
    
    # STEP 0: Create a general plotting database with quarter and month
    plotData <- df %>%
        filter(year %in% curYear) %>%
        left_join(epiMonth, by=c("week"="ew")) %>%
        mutate(quarter=factor(paste0(year, "-", "Q", quarter)), 
               postStart=ifelse(week>=startWeek | yearint>min(curYear), 1, 0)
               ) %>%
        group_by(age, yearint, quarter, month, postStart) %>%
        summarize(excess=sum(delta), .groups="drop") %>%
        left_join(popData, by="age") %>%
        mutate(excesspm=excess*1000000/pop)
    
    # STEP 1: Plot of excess deaths by quarter
    p1 <- plotData %>%
        group_by(age, quarter) %>%
        summarize(excess=sum(excess), .groups="drop") %>%
        ggplot(aes(x=fct_reorder(age, excess, .fun=sum), y=excess/1000, fill=fct_rev(quarter))) + 
        geom_col(position="stack") + 
        coord_flip() + 
        labs(x="Age Group", 
             y="Excess Deaths (000s)", 
             title=paste0("All-cause excess deaths in ", 
                          if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
                          ), 
             subtitle=subT
        ) +
        scale_fill_discrete("Quarter")
    print(p1)
    
    # STEP 2: Plot of excess deaths by epi-month
    pea_001 <- plotData %>%
        filter(postStart==1) %>%
        ggplot(aes(x=fct_reorder(age, excesspm, .fun=sum))) + 
        geom_col(aes(y=excesspm, fill=fct_rev(paste0(yearint, "-", match(month, month.abb) %>% zeroPad2()))), 
                 position="stack"
                 ) + 
        coord_flip() + 
        labs(x="Age Group", 
             y="Excess Deaths (per million people)", 
             title=paste0("All-cause excess deaths in ", 
                          if(length(curYear)==1) curYear else paste0(curYear, collapse="-")
                          ), 
             subtitle=subT, 
             caption="Epi week converted to month based on most frequent month in epi week"
        ) +
        geom_text(data=. %>% group_by(age) %>% summarize(excesspm=sum(excesspm)), 
                  aes(y=excesspm+100, label=round(excesspm), hjust=0)
        ) +
        scale_fill_discrete("Epi Month")
    print(pea_001)
    
    # STEP 3: Return the plotting data
    plotData
    
}

curYear <- 2020:2021
ageExcessPlot_temp <- helperKeyAgeExcessPlots(df=ageAgg_temp, 
                                              epiMonth=readFromRDS("epiMonth"),
                                              popData=readFromRDS("usPopBucket2020"),
                                              startWeek=10,
                                              subT=paste0("CDC data through week ", 
                                                          weekThru, 
                                                          " of ", 
                                                          max(curYear)
                                                          ), 
                                              curYear=curYear
                                              )

The function cdcAggregateSummary() is updtaed to allow for piping detailed summaries to pdf:

# Function to create data and plots for an aggregate (cluster in this case)
cdcAggregateSummary <- function(df, 
                                critVar, 
                                critSubsets, 
                                startWeek,
                                idVarName=critVar,
                                curYear=2020:2021, 
                                startYear=2015,
                                subListNames=NULL,
                                critListNames=subListNames,
                                factorCritList=!is.null(critListNames),
                                popData=NULL,
                                cvDeathData=NULL,
                                showAllPlots=TRUE, 
                                showStep1Plots=showAllPlots,
                                showStep3Plots=showAllPlots, 
                                pdfFile=NULL
                                ) {
    
    # FUNCTION ARGUMENTS:
    # df: the data frame containing the processed CDC all-cause deaths data
    # critVar: the main variable supplied to critFilter
    # critSubsets: the subsets used for critVar
    # startWeek: the starting week for analysis (will assume no excess deaths prior to startWeek or curYear)
    # idVarName: name to be used for the identifying variable when excess deaths data are combined
    # curYear: current year(s) of CDC deaths data
    # subListNames: names to be used in place of numbers for idVarName column of excessFull
    #               (NULL means use numbers)
    # critListNames: names to be used for describing the key cohorts (NULL means use numbers)
    # factorCritList: boolean, whether to factor the variable that results from critListNames
    # startYear: starting year of CDC deaths data
    # popData: population data file, unique and matching by idVarName (NULL means no population plots)
    # cvDeathData: coronavirus deaths data file, unique and matching by idVarName
    #              NULL means no plots of excess all-cause deaths vs reported coronavirus deaths
    # showAllPlots: boolean, whether to show all the plots
    # showStep1Plots: boolean, whether to show a 1-page plot summary for each element of critSubsets
    # showStep3Plots: boolean, whether to show a 1-page cross-element summary
    # pdfFile: a path for the detailed step 1 plots to be stored as PDF (NULL means store in the log)
    
    # STEP 0: Initialize list to store results for each combination of critSubsets    
    tmpList <- vector("list", length(critSubsets))
    
    # STEP 1: Run the process for each combination of critSubsets (produce plots if showStep1Plots is TRUE)
    # Redirect detailed plots to PDF if requested
    if (!is.null(pdfFile)) {
        cat("\nDetailed", idVarName, "summary PDF file is available at:", pdfFile, "\n")
        pdf(pdfFile, width=11)
    }
    # Run the detailed sumaries for each level
    for (ctr in seq_along(critSubsets)) {
        critUse <- list(critSubsets[[ctr]])
        names(critUse) <- critVar
        useName <- if(is.null(critListNames)) paste0("Iteration: ", ctr) else critListNames[ctr]
        tmpList[[ctr]] <- cdcCohortAnalysis(cohortName=useName, 
                                            df=df, 
                                            critFilter=critUse, 
                                            plotTitle="",
                                            curYear=curYear,
                                            startYear=startYear,
                                            startWeek=startWeek, 
                                            showSubsetPlots=FALSE, 
                                            showPredActualPlots=showStep1Plots,
                                            predActualPlotsOnePage=showStep1Plots
        )
    }
    # Redirect all other output to the main log
    if (!is.null(pdfFile)) {
        cat("\nReturning plot outputs to the main log file\n")
        dev.off()
    }
    
    # STEP 1a: Add names if requeated
    if (!is.null(subListNames)) names(tmpList) <- subListNames
    
    # STEP 2: Create the excess file
    excessFull <- map_dfr(.x=tmpList, .f=function(x) { x[["preds"]] }, .id=idVarName)
    
    # STEP 2a: Convert idVarName to factor if critListName is passed
    if (factorCritList) {
        excessFull <- excessFull %>%
            mutate(!!idVarName:=factor(get(idVarName), levels=critListNames))
    }
    
    # STEP 3: Produce a plot of excess deaths by week by idVarName
    if (showStep3Plots) {
        p1 <- excessFull %>%
            filter(year %in% curYear) %>%
            ggplot(aes(x=week, y=delta)) + 
            geom_line(aes_string(group=idVarName, color=idVarName)) + 
            labs(title=paste0("Actual minus predicted deaths by week (", 
                              if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear, 
                              ")"
                              ), 
                 x="Week", 
                 y="Actual minus predicted deaths"
            ) + 
            facet_wrap(~year) +
            scale_color_discrete(idVarName)
        print(p1)
    }
    
    # STEP 4: Augment the data to include population (if popData is not NULL) and plot "per million"
    if (!is.null(popData)) {
        # Add population data to excessFull (only variables should be idVarName and pop)
        excessFull <- excessFull %>%
            left_join(select_at(popData, vars(all_of(c(idVarName, "pop")))), by=idVarName)
        # Plot of excess deaths per million by week by cluster
        p2 <- excessFull %>%
            filter(year %in% curYear) %>%
            ggplot(aes(x=week, y=1000000*delta/pop)) + 
            geom_line(aes(group=cluster, color=cluster)) + 
            labs(title=paste0("Actual minus predicted deaths by week (", 
                              if(length(curYear)>1) range(curYear) %>% paste0(collapse="-") else curYear, 
                              ")"
                              ), 
                 subtitle="Per million people, per week",
                 x="Week", 
                 y="Actual minus predicted deaths (per million people)"
            ) + 
            scale_color_discrete(idVarName) + 
            facet_wrap(~year) +
            geom_hline(aes(yintercept=0), lty=2)
        print(p2)
    }
    
    # STEP 5: Augment the data to show plot of excess all-cause deaths vs. reported coronavirus deaths
    if (!is.null(cvDeathData)) {
        # Add the reported coronavirus deaths total
        excessFull <- excessFull %>%
            left_join(select_at(cvDeathData, vars(all_of(c(idVarName, "year", "week", "cvDeaths")))), 
                      by=c(idVarName, "yearint"="year", "week")
            ) %>%
            mutate(cvDeaths=ifelse(is.na(cvDeaths) | !(year %in% curYear), 0, cvDeaths))
        # Plot for total excess deaths and reported coronavirus deaths by idVarName by week
        p3 <- excessFull %>%
            filter(year %in% curYear) %>%
            mutate(useWeek=week+53*(yearint-min(curYear))) %>%
            select_at(vars(all_of(c(idVarName, "year", "week", "useWeek", "cvDeaths", "delta")))) %>%
            pivot_longer(c(cvDeaths, delta), names_to="source", values_to="deaths") %>%
            ggplot(aes(x=useWeek, y=deaths)) + 
            geom_line(aes(group=source, 
                          color=c("cvDeaths"="Reported COVID", "delta"="Excess All-Cause")[source]
                          )
                      ) + 
            facet_wrap(~get(idVarName)) + 
            scale_color_discrete("Deaths") + 
            labs(x="Week", 
                 y="Deaths", 
                 title=paste0("Deaths by ", idVarName, " by week"), 
                 subtitle=paste0("Week 1 is the first epi week of ", min(curYear))
                 )
        print(p3)
    }
    
    # STEP n: Return the excess file
    excessFull
    
}


clNames <- clusterList_temp$useClusters %>% unique()
clValues <- lapply(clNames, FUN=function(x) names(clusterList_temp$useClusters)[clusterList_temp$useClusters==x])
clusterAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))), 
                                       critVar="state", 
                                       critSubsets=clValues,
                                       startWeek=1, 
                                       subListNames=clNames,
                                       critListNames=paste0("cluster ", clNames),
                                       factorCritList=FALSE,
                                       popData=clusterList_temp$deaths %>% 
                                           group_by(cluster) %>% 
                                           summarize(pop=max(pop), .groups="drop"),
                                       cvDeathData=clusterList_temp$deaths %>% 
                                           select(cluster, year, week, cvDeaths=new_deaths),
                                       idVarName="cluster", 
                                       showAllPlots=TRUE, 
                                       pdfFile="./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w17.pdf"
                                       )
## 
## Detailed cluster summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_cluster_2021w17.pdf
## 
## Returning plot outputs to the main log file

stateAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))), 
                                     critVar="state", 
                                     critSubsets=setdiff(names(clusterList_temp$useClusters), "NC"),
                                     startWeek=1, 
                                     idVarName="state", 
                                     subListNames=setdiff(names(clusterList_temp$useClusters), "NC"),
                                     showAllPlots=FALSE
                                     )
stateAgg_temp
## # A tibble: 16,500 x 10
##    state year   week deaths weekfct yearint  pred   delta cumDelta cumPred
##    <fct> <fct> <int>  <dbl> <fct>     <int> <dbl>   <dbl>    <dbl>   <dbl>
##  1 AK    2015      1     62 1          2015  69.7  -7.71    -7.71     69.7
##  2 AK    2016      1     60 1          2016  71.4 -11.4    -11.4      71.4
##  3 AK    2017      1     85 1          2017  73    12.0     12.0      73  
##  4 AK    2018      1     71 1          2018  74.6  -3.65    -3.65     74.6
##  5 AK    2019      1     87 1          2019  76.3  10.7     10.7      76.3
##  6 AK    2020      1     77 1          2020  77.9  -0.938   -0.938    77.9
##  7 AK    2021      1    103 1          2021  79.6  23.4     23.4      79.6
##  8 AK    2015      2     77 2          2015  73.1   3.89    -3.82    143. 
##  9 AK    2016      2     65 2          2016  74.8  -9.75   -21.1     146. 
## 10 AK    2017      2     77 2          2017  76.4   0.600   12.6     149. 
## # ... with 16,490 more rows
ageAgg_temp <- cdcAggregateSummary(df=cdc_temp %>% filter(!(state %in% c("NC"))), 
                                   critVar="age", 
                                   critSubsets=levels(cdc_temp$age),
                                   startWeek=1, 
                                   idVarName="age", 
                                   subListNames=levels(cdc_temp$age),
                                   showAllPlots=TRUE, 
                                   pdfFile="./RInputFiles/Coronavirus/Plots/CDC_age_2021w17.pdf"
                                   )
## 
## Detailed age summary PDF file is available at: ./RInputFiles/Coronavirus/Plots/CDC_age_2021w17.pdf
## 
## Returning plot outputs to the main log file

ageAgg_temp
## # A tibble: 1,980 x 10
##    age           year   week deaths weekfct yearint  pred delta cumDelta cumPred
##    <fct>         <fct> <int>  <dbl> <fct>     <int> <dbl> <dbl>    <dbl>   <dbl>
##  1 Under 25 yea~ 2015      1   1027 1          2015 1099. -72.4    -72.4   1099.
##  2 Under 25 yea~ 2016      1   1027 1          2016 1079. -51.9    -51.9   1079.
##  3 Under 25 yea~ 2017      1   1107 1          2017 1058.  48.6     48.6   1058.
##  4 Under 25 yea~ 2018      1   1142 1          2018 1038. 104.     104.    1038.
##  5 Under 25 yea~ 2019      1    989 1          2019 1017. -28.4    -28.4   1017.
##  6 Under 25 yea~ 2020      1   1067 1          2020  997.  70.1     70.1    997.
##  7 Under 25 yea~ 2021      1    994 1          2021  976.  17.6     17.6    976.
##  8 Under 25 yea~ 2015      2   1050 2          2015 1090. -39.6   -112.    2189.
##  9 Under 25 yea~ 2016      2   1036 2          2016 1069. -33.1    -85.0   2148.
## 10 Under 25 yea~ 2017      2   1125 2          2017 1049.  76.4    125.    2107 
## # ... with 1,970 more rows

Next steps are to integrate the functions in readRunCDCAllCause()